34 research outputs found
Shape Anchor Guided Holistic Indoor Scene Understanding
This paper proposes a shape anchor guided learning strategy (AncLearn) for
robust holistic indoor scene understanding. We observe that the search space
constructed by current methods for proposal feature grouping and instance point
sampling often introduces massive noise to instance detection and mesh
reconstruction. Accordingly, we develop AncLearn to generate anchors that
dynamically fit instance surfaces to (i) unmix noise and target-related
features for offering reliable proposals at the detection stage, and (ii)
reduce outliers in object point sampling for directly providing well-structured
geometry priors without segmentation during reconstruction. We embed AncLearn
into a reconstruction-from-detection learning system (AncRec) to generate
high-quality semantic scene models in a purely instance-oriented manner.
Experiments conducted on the challenging ScanNetv2 dataset demonstrate that our
shape anchor-based method consistently achieves state-of-the-art performance in
terms of 3D object detection, layout estimation, and shape reconstruction. The
code will be available at https://github.com/Geo-Tell/AncRec
Dynamic Contrastive Distillation for Image-Text Retrieval
Although the vision-and-language pretraining (VLP) equipped cross-modal image-text retrieval (ITR) has achieved remarkable progress in the past two years, it suffers from a major drawback: the ever-increasing size of VLP models restrict its deployment to real-world search scenarios (where the high latency is unacceptable). To alleviate this problem, we present a novel plug-in dynamic contrastive distillation (DCD) framework to compress the large VLP models for the ITR task. Technically, we face the following two challenges: 1) the typical uni-modal metric learning approach is difficult to directly apply to cross-modal task, due to the limited GPU memory to optimize too many negative samples during handling cross-modal fusion features. 2) it is inefficient to static optimize the student network from different hard samples, which have different effects on distillation learning and student network optimization. We try to overcome these challenges from two points. First, to achieve multi-modal contrastive learning, and balance the training costs and effects, we propose to use a teacher network to estimate the difficult samples for students, making the students absorb the powerful knowledge from pre-trained teachers, and master the knowledge from hard samples. Second, to dynamic learn from hard sample pairs, we propose dynamic distillation to dynamically learn samples of different difficulties, from the perspective of better balancing the difficulty of knowledge and students' self-learning ability. We successfully apply our proposed DCD strategy on two state-of-the-art vision-language pretrained models, i.e. ViLT and METER. Extensive experiments on MS-COCO and Flickr 30 K benchmarks show the effectiveness and efficiency of our DCD framework. Encouragingly, we can speed up the inference at least 129 × compared to the existing ITR models. We further provide in-depth analyses and discussions that explain where the performance improvement comes from. We hope our work can shed light on other tasks that require distillation and contrastive learning